Lexicon-based Orthographic Disambiguation in CJK Intelligent Information Retrieval
نویسنده
چکیده
The orthographical complexity of Chinese, Japanese and Korean (CJK) poses a special challenge to the developers of computational linguistic tools, especially in the area of intelligent information retrieval. These difficulties are exacerbated by the lack of a standardized orthography in these languages, especially the highly irregular Japanese orthography. This paper focuses on the typology of CJK orthographic variation, provides a brief analysis of the linguistic issues, and discusses why lexical databases should play a central role in the disambiguation process.
منابع مشابه
TREC-9 CLIR Experiments at MSRCN
In TREC-9, we participated in the English-Chinese Cross-Language Information Retrieval (CLIR) track. Our work involved two aspects: finding good methods for Chinese IR, and finding effective translation means between English and Chinese. On Chinese monolingual retrieval, we investigated the use of different entities as indexes, pseudorelevance feedback, and length normalization, and examined th...
متن کاملBuilding a Dataset of Multilingual Cognates for the Romanian Lexicon
Identifying cognates is an interesting task with applications in numerous research areas, such as historical and comparative linguistics, language acquisition, cross-lingual information retrieval, readability and machine translation. We propose a dictionary-based approach to identifying cognates based on etymology and etymons. We account for relationships between languages and we extract etymol...
متن کاملComparing Canonicalizations of Historical German Text
Historical text presents numerous challenges for contemporary natural language processing techniques. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any system requiring reference to a static lexicon accessed by orthographic form. In this paper, we present three methods for associating unknown historical word forms with synchronica...
متن کاملEvaluating Resources for Query Translation in Cross-Language Information Retrieval
Our goal is to evaluate the utility of a lexical resource containing Lexical Conceptual Structures LCS for use in cross language information retrieval Our evaluation makes use of a combination of techniques from interlingual machine translation Dorr with conventional information retrieval techniques Oard OardandDorr Given a query in one language we transform the query into the corresponding ter...
متن کاملBilingual Terminology Acquisition from Comparable Corpora and Phrasal Translation to Cross-Language Information Retrieval
The present paper will seek to present an approach to bilingual lexicon extraction from non-aligned comparable corpora, phrasal translation as well as evaluations on Cross-Language Information Retrieval. A two-stages translation model is proposed for the acquisition of bilingual terminology from comparable corpora, disambiguation and selection of best translation alternatives according to their...
متن کامل